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Abstract 

For robots to be effective in human environments, they should be capable of successful task execution in unstructured 
environments. Of these, many task oriented manipulation behaviors executed by robots rely on model based grasping 
strategies and model based strategies require accurate object detection and pose estimation. Both these tasks are hard in 
human environment, since human environments are plagued by partial observability and unknown objects. Given these 
constraints, it becomes crucial for a robot to be able to operate effectively under partial observability in unrecognized 
environments. Manipulation in such environments is also particularly hard, since the robot needs to reason about the 
dynamics of how various objects of unknown or only partially known shape interact with each other under contact. 
Modelling the dynamic process of a cluttered scene during manipulation is hard even if all object models and poses were 
known. It becomes even harder to reasonably develop a process or observation model, with only partial information 
about the object class or shape. To enable a robot to effectively operate in partially observable unknown environments we 
introduce a policy learning framework where action selection is cast as a probabilistic classification problem on hypothesis 
sets generated from observations of the environment. Online the action classifier is operated with a global stopping 
criterion for successful task completion. The example we consider is object search in clutter, where we assume having 
access to a visual object detector, that directly populates the hypothesis set given the current observation. Thereby we 
can avoid the temporal modelling of the process of searching through clutter. We demonstrate our algorithm on two 
manipulation based object search scenarios; a modified minesweeper simulation and a real world object search in clutter using 
a dual arm manipulation platform. 
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1 Introduction 


For robots to be able to manipulate in unknown and unstructured environments the robot should be capable of operating 
under partial observability of the environment. Object occlusions and unmodeled environments are some of the factors 
that result in partial observability which in turn causes an uncertainty in the robot state estimate. A common scenario 
where this is encountered is manipulation in clutter. In the case that the robot needs to locate an object of interest and 
manipulate it, it needs to perform a series of decluttering actions to accurately detect the object of interest. To perform 
such a series of actions, the robot also needs to account for the dynamics of objects in the environment and how they 
react to contact. This is a non trivial problem since one needs to reason not only about robot-object interactions but also 
object-object interactions in the presence of contact. In the example scenario of manipulation in clutter, the state vector 
would have to account for the pose of the object of interest and the structure of the surrounding environment. The 
process model would have to account for all the aforementioned robot-object, object-object interactions. The complexity 
of the process model grows exponentially as the number of objects in the scene increases. This is commonly the case in 
unstructured environments. Hence it is not reasonable to attempt to model all object-object and robot-object interactions 
explicitly. 

Also in some cases of human decision making we observe that we don't reason over all the possible agent-object and 
object-object interactions when manipulating in unstructured environments. For instance, imagine the case where you 
are looking for your keys on a table among clutter. When sifting through clutter we don't reason about all possible 
agent-object or object-object interactions. Since we have an accurate model of the object of interest, i.e the keys, we 
only reason about a limited set of cases. Such as the possibility of the keys being occluded by an object, etc. Under 
this setting we can formulate the problem as one where we construct a set of hypothesis about the possible poses of the 
object of interest given the current evidence in the scene and select actions based on our current set of hypothesis. This 
hypothesis set tends to represent the belief about the structure of the environment and the number of poses the object 
of interest can take. The uncertainty relating to the pose of the object of interest is directly dependent on the structure 
of the environment, i.e on the number other known or unknown objects in the environment. The agent's only stopping 
criterion is when the uncertainty regarding the pose of the object is fully resolved. The question to naturally pose is, is it 
possible to learn a search policy for such settings in real systems. Also what are the constraints that must be applied to 
the problem setting to make learning tractable. A crucial factor to note is, as the size of the environment grows, the size 
of this hypothesis set also grows. 

2 Problem Formulation 

Consider a robot that has access to a database of object models O = {Oi,...., On} and a set of actions A = 

These actions could be movement primitives. Our task is to locate an object of interest O^ G O in a cluttered environment. 
To accomplish this task, we need to execute a sequence of actions from A to manipulate the environment, to accurately 
detect Oi. For this problem we denote our current state vector as G A' which comprises of the pose of Oi represented 
by Vt G V. Vt is dictated by an object model and the current structure of the environment £t G £. £t is a voxelized repre¬ 
sentation where the occupancy of voxels are informed by the poses of all the other detected objects in the environment, 
whose shapes are dictated by object models or shape primitives. Let b denote the belief state, i.e. the distribution over 
the state space X. Our objective is to learn a policy that will give us an action to execute given our current belief about 
the state. In essence we want to learn a policy tt : b{Xt) A, where Xf = [Vt; £t]- To determine the optimal sequence of 
actions to achieve our task, we can formulate the problem as a POMDP, where our optimal policy would be given by 

TT* = argmax V^{b{Xo)) 

TT 

where 6 (Ao) is our initial belief. The optimal policy, denoted by tt* yields the highest expected reward value for each 
belief state, which is represented by an optimal value function U*. This value function can be calculated as 

7^(6(Xt),a) + 7 ^ 0{Zt\b{Xt),a)V*{T{b{Xt),a,Zt)) 

Here 7 is a discount factor and our reward is defined as: 

R(i.ra,«) = {J 

An action a = ater if the object of interest is successfully located. In this formulation we also assume access to an 
observation model 0{Zt\b{Xt), a) and a process model r{b{Xt)ya^ Zf), i.e we can accurately predict the outcome of an 
action. The process model in this formulation inherently assumes one of two criteria. Either we can model the dynamics 
of interactions between various rigid bodies in the environment or we can model the evolution of the hypothesis set as an 


VAbiXt)) =max 
aeA 


1 





outcome of actions executed. As mentioned earlier in Sectionboth of these tasks are non trivial. Given the context of 
our problem it is not easy to model object-object and robot-object interactions or model the change in the state uncertainty 
as an outcome of physical interaction. A possible argument to model either of these phenomena would be to learn from 
demonstrations or synthetic data. Even if we were to learn these distributions from demonstrations or synthetic data, the 
number of samples required to reasonably approximate the state space would be exponential in the number of objects in 
the environment. A similar argument can be made for the observation model. Also, the belief function b{Xt) is hard to 
estimate given a large state space, as it needs to account for the object pose V and the entire structure of the environment 
£. Hence, we constrain this general formulation. 

We note that we can in principle filter the belief using Bayesian filtering to account for the entire history of observations 
and actions. In our case, the belief function b{) represents the distribution over the object poses and structure of the envi¬ 
ronment. Note that the object poses are dependent on the structure of the environment hence modeling this uncertainty 
is not straightforward. Instead of parameterizing the distribution of the state vector At, we adopt a non parameterized 
approach where we use a discrete set of hypotheses V, = {Hi ,...., Hm} that can be constructed using the model of our 
object of interest Oi and the current state of the environment £f The state of the environment at time t is estimated from 
observation Zt given by a visual sensor. Given our current observation Zt, we specify the belief b{Xt) as the current 
hypotheses object poses with respect to the visible environment, given by the set Ht = b{Xt). This hypothesis set is 
constructed using tools from vision that take the object model Oi and observation Zf and return Ht = ^{Zf, Oi). The 
objective of the problem is to manipulate the environment till we have reduced the cardinality of our current hypoth¬ 
esis set to 1, \\Ht\\ = 1 so that we can successfully execute a model based manipulation action. We define this action 
as a terminal action ater ^ ^ with reward 1. In an effort to make learning and inference in this setting tractable, we 
approximate quantities that can easily observed and modeled. Instead of trying to learn the dynamics of interactions 
in the environment, we try to directly learn a mapping between the belief state b{Xt) and actions A. This mapping is 
learned with discriminative classifiers that return an action given the current belief state. To ensure that the state space 
of the problem does not grow exponentially with the number of objects in the scene, we make the classifiers agnostic to 
the complete state of the environment and instead have them classify actions based on features computed on the current 
hypothesis set Hf. We assume that we can construct the hypothesis set for any object model O under any observation in 
Z, i.e H = Oi). Hence our policy learning problem is reduced to 

TT* = argmax viF f{b{Xt), a) where b{Xt) = Ht 

a 

Here different policies can be learned and compared by either altering the features or the number of classes, i.e actions. 

3 Modified Minesweeper Simulation 

We emulate the problem of action selection under partial observability using a modified minesweeper scenario. In our 
modified minesweeper scenario, the mines are organized into a fixed size H-structure in the grid. The objective of the 
game is to accurately determine the pose of this hidden H-structure by opening a minimum number of non-mine cells. 
As in the classical minesweeper scenario opened cells may either be 
numbered or empty indicating the number of mines in the 8-connected 
neighbourhood or the opened cell might be a mine in which case the 
game terminates. The agent selects actions based on its current hypoth¬ 
esis set. This set is constructed based on the current observation, i.e 
opened cells and their values. The game is completed when the agent 
has narrowed down its set of hypothesis to one. The set of actions 
available to the agent is to open a cell from the 8-connected neighbour¬ 
hood of the current open cell. The game play is initialized randomly. 

A demonstration of this game play environment is show in Figure]^ where 
Figure is the actual game play environment. Fi gur e p]b| is the ground 
truth location of the hidden H-structure and Figure]]^ shows the features 
computed on the current hypothesis set. The feature we use is an inverse 
distance transform where cells close to the current set of hypothesis get a 
high score and cells far away from the hypothesis set get a low score. We 
then extract local templates from the features computed on the hypothe¬ 
sis set. These templates are 3x3 patches around the current expert loca¬ 
tion. The class corresponding to the feature is the location of the next ac¬ 
tion selected by the expert in the 8-connected neighbourhood. The evo¬ 
lution of the hypothesis set corresponding to the current game environ¬ 
ment is demonstrated in Figure We train the agent with demonstrations 
from an expert where the expert plays the game over a number of trials. 
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Figure 1: Modified Minesweeper 
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Figure 2: Hypothesis Set and Game Envi¬ 
ronment Updates 












































We compare different agents against a heuristic 
player (HP). The agents trained were a Multiclass 
(MC) 1 VS all SVM trained on the local templates 
with 8 -connected neighbourhood as; a binary agent 
(BE) that classifies a local template from anywhere 
on the grid as actionable or not and a binary 8 - 
connected (B 8 ) agent that applies the binary agent to 
the 8 -connected grid. We tested the various agents Table 1: Results of Minesweeper Tests 

over 100 different trials with 10 random poses of the 

hidden H-structure and each of the 10 poses had 10 different initializations for the agent. The results are tabulated in 
Table [H The results show the mean number of actions taken over the successful trials off the 10 trials. The number of 
failed attempts in these 10 trials are boldfaced. Failures result due to opening a mine or in the B8 case failing to classify 
any neigbouring grid as actionable. The best result for each random pose are highlighted in green. 

4 Transition to a Real Robot Environment 

We apply the same policy learning framework to a real robot decluttering experiment, where the robot is tasked with 
locating an object of interest in a cluttered environment. Here the input observation Zi is an RGBD pointcloud. The 
hypothesis set Ht, of the object of interest is computed using the output of an object classifier Q], that returns an object 
class and pose hypothesis for every pointcloud cluster in the environment. These hypotheses are then projected on to 
a planar support surface (tabletop) to compute a hypothesis feature similar to the minesweeper scenario. The general 
pipeline is demonstrated in the figure below. 


Agent 

Trial 1 

Trial 2 

Trial 3 

Trial 4 

Trial 5 

Trial 6 

Trial 7 

Trial 8 

Trial 9 

Trial 10 

MC 

1 12.3 1 


8.4;2 


1 8-1 1 

10.1 

7.1;1 

iHr 

7.1;1 

10.1 

12.6 

13.2;2 

BE 

13.8 

11.3 

13.2 

16 

10.6 

15.6 

20.8 

11.6 

12.1 1 

14.5 

B8 

12.3;3 


8.4;2 


9.1;3 

10.1;2 

9;4 

17.8;4 

8.3;3 

10.1;2 

12.6;5 

13.2;3 

HP 

25.6 

9.3 

8;1 

6.4 1 

7.8;1 

13.8 

8.3 


28.5 

21 



(a) Input Point Cloud 



(a) Projected hypothesis 




(b) Pointcloud Clustering (c) Preprocessing Overlay 

Figure 3: Point cloud preprocessing 


(b) Hypothesis Overlay (c) Env Occupancy Grid 

Figure 4: Hypothesis Feature Computation 





5 Conclusions and Future Work 



(d) VP-Tree Classifier 



(d) Inverse Dist Transform 


We have demonstrated a policy learning approach for hypotheses based action selection. Our approach is trained in a 
supervised manner with expert demonstrations. The key features of our approach are we can accomplish complex tasks 
without reasoning about a process or observation model. Our approach also has the ability to scale to large environments 
and the learning complexity is agnostic to the size of the environment. Our proposed model simplification approach is 
only valid for the class of POMDP problems where states are strictly markovian in nature ex: ElO, i.e where the current 
observation encompases the history of all previous observations. In the future we are going to perform more tests on our 
robotic setup and apply this frame work to other policy learning tasks. 
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